Document Type : Original Article
Authors
1 Director of Ilam Petrochemical Company, Ilam, Iran
2 Department of Chemistry, Payame Noor University, P.O. BOX 19395-4697, Tehran, Iran
Abstract
Water pollution is a major global problem which requires ongoing evaluation and revision of water resource policy at all levels (international down to individual aquifers and wells. It has been suggested that it is the leading worldwide cause of deaths and diseases, and that it accounts for the deaths of more than 14,000 people daily. Genetic algorithm-partial least square (GA-PLS), Kernel partial least square (GA-KPLS) and Levenberg-Marquardt artificial neural network (L-M ANN) techniques were used to investigate the correlation between retention time (RT) and descriptors for 150 organic contaminants in natural water and wastewater which obtained by gas chromatography coupled to high-resolution time-of-flight mass spectrometry (GC-TOF MS). The L-M ANN model gave a significantly better performance than the other models. This indicates that L-M ANN can be used as an alternative modeling tool for quantitative structure–retention relationship (QSRR) studies.
Graphical Abstract
Keywords
- Water Pollution
- Hazardous chemicals
- Organic pollutants
- Gas Chromatography
- Time-of-flight mass spectrometry
- Chemometrics
- Levenberg-Marquardt artificial neural network
Main Subjects
Introduction
Water pollution is the contamination of the water bodies including lakes, rivers, oceans, and groundwater. Water pollution occurs when pollutants are discharged directly or indirectly into the water bodies without adequate treatment to remove the harmful compounds. Water pollution affects plants and organisms living in the water bodies. In almost all cases, not only it has a negative effect on the individual species and populations, but also it damages the natural biological communities [1,2].
An estimated 700 million Indians have no access to a proper toilet, and 1,000 Indian children die because of diarrheal sickness every day. 90% of cities in China suffers from water pollution, and around 500 million people do not have access to a safe drinking water [2,3]. In addition to the acute problems of water pollution in developing countries, the developed countries struggle with the pollution problems as well. In the most recent national report on water quality in the United States, 45 % of the assessed stream miles, 47 % of the assessed lake acres, and 32 % of the assessed bay and estuarine square miles were classified as pollution [3].
atural phenomena such as volcanoes, algae blooms, storms, and earthquakes also cause major changes in water quality and the ecological status of water Surface water and groundwater have often been studied and managed as separate resources, although they are interrelated. Surface water seeps through the soil and becomes groundwater. Conversely, groundwater can also feed surface water sources. Sources of surface water pollution are generally grouped into two categories based on their origin [4].
Contaminants in water may include organic and inorganic substances. Some organic water pollutants are:
Insecticides and herbicides, a huge range of organohalide and other chemicals Bacteria, often is from sewage or livestock operations Food processing waste, including pathogens Tree and brush debris from logging operations VOCs (Volatile Organic Compounds, industrial solvents) from improper storage [5, 6]. Some inorganic water pollutants include: Heavy metals including acid mine drainage Acidity caused by industrial discharges (especially sulfur dioxide from power plants) Chemical waste as industrial by products Fertilizers, in runoff from agriculture including nitrates and phosphates Silt in surface runoff from construction sites, logging, slash and burn practices or land clearing sites [7]. Organic pollution occurs when an excess of organic matter, such as manure or sewage enters the water. When organic pollution increases in a pond, the number of decomposers will increase. As the aquatic organisms die, they are broken down by decomposers, leading to further depletion of the oxygen. A type of organic pollution can occur when inorganic pollutants such as nitrogen and phosphates accumulate in aquatic ecosystems. High level of these nutrients cause an overgrowth of plants and algae. As the plants and algae die, they become organic material in the water. The enormous decay of this plant matter, in turn, lowers the oxygen level. The process of rapid plant growth followed by increased activity by decomposers and a depletion of the oxygen level is called eutrophication [5, 6].
There are several important reasons why social scientists should examine the causes of organic water pollution. First, it is largely the result of human activities. The industrial activities that contribute to organic water pollution include manufacturing of glass, pesticides, medicines, plastics, ceramics, textiles, metals, and paper [8]. Some other activities that contribute to water pollution include food processing facilities with inadequate disposal facilities and the dispersing of water used to cool coke during steel production. The chemicals and byproducts of these manufacturing and industrial processes often end up as waste and are disposed of by being dumped into rivers, lakes, and streams [9].
Second, water pollution has been associated with many other environmental problems. For instance, many chemicals that dumped into waterways are not only highly toxic but also take a long time to decompose. Consequently, there is a shift in the pH of water. The pH shift causes certain plants and animals to die off while allowing others to reproduce unchecked, thereby reducing biodiversity. Some water pollutants also stimulate oxygen consumption by plants, algae, and bacteria. This process reduces levels of dissolved oxygen creating a situation of chronic “stress” that lowers the body weight of aquatic animals and makes them less able to compete for food and habitat. It also creates a situation that is toxic to some fish and aquatic invertebrates, which die due to lack of oxygen [10].
Third, water pollution from industrial and manufacturing activity has serious health effects in humans [11, 12]. The toxic chemicals found in water supplies affect people through the process of “bioaccumulation” or the building up of toxins in the fatty tissue of mammals. The long-term effects of bioaccumulation in adults include cancer, blood disorders, immunity suppression, and spontaneous abortions. The buildup of these pollutants has been linked to birth defects.
The United States Environmental Protection Agency (EPA) monitors and analyzes organic pollutants in water. The EPA has established a list of a "dirty dozen" particularly widespread and persistent organic pollutants (POPs). Part of the EPA's mandate is to identify where these pollutants occur in water resources and to contain or mitigate POPs.
The POPs include intentionally produced chemicals such as pesticides as well as industry or combustion by-products. The dirty dozen are aldrin, chlordane, DDT, dieldrin, endrin, heptachlor, hexachlorobenzene, mirex, toxaphene, PCBs, dioxins and furans.
EPA laboratories in Cincinnati, Ohio and Athens, Georgia investigated analytical methods to analyze organic pollutants in water based on gas chromatography separation of the pollutants and mass spectrometer identification and quantification. The research results were published as the EPA's test methods 624 and 625 for the standard analysis of organic pollutants in municipal and industrial effluent [13].
Gas chromatography separates organic pollutants for further analysis. The researcher injects a sample into the gas chromatography instrument. The instrument heats the sample to a gas and injects it into the gas chromatography tube or column. As the sample travels the length of the column, the different organic molecules condense and liquefy and then vaporize as a gas again. As a liquid, the molecules stick to the column, but as a gas, they travel through the column quickly. Different pollutants have different ratios of gas to liquid, so they each travel through the column at different speeds [14]. The separated pollutants are then analyzed by mass spectrometry. A mass spectrometer ionizes a sample and shoots it through an electric field. The electric field bends the path (trajectory) of lighter molecules more than that of heavy molecules. The sample strikes a detector at a certain position based on its mass. This method identifies and quantifies organic pollutants in water after they have been separated by gas chromatography. The combination of gas chromatography and mass spectrometry give researchers complete information on the type of organic pollutants in a sample and the concentration of each pollutant in the sample.
Most of these methods are focused on target analysis with quantitative purposes and their scope rarely exceeds several tens of analytes, being quite unusual to find analytical methods for the determination of more than 100 organic pollutants. In the last decade there has been a notable increase in the use of full spectrum acquisition techniques, such as time-of-flight mass spectrometry (TOF MS), which allows acquiring huge amount of chemical information on the sample in a single analysis [15, 16]. This facilitates widening the number of analytes that can be searched in a single experiment, with the additional advantage that data can be re-examined at any time to search for other compounds not included in the first screening, without the need of additional analysis. TOFMS and hybrid quadrupole-TOFMS have been successfully applied for screening purposes in combination with gas chromatography (GC) or liquid-chromatography (LC) in different applied fields, like environmental analysis, food safety or toxicology. This analyzer provides the selectivity and sensitivity required for wide-scope screening, as it combines high full-spectral sensitivity with high mass resolution. Accurate mass data obtained can be processed in both “post-target” and/or non-target way, which gives high versatility to the instrument which allows the user to tackle an analytical problem in different ways, depending on the aim of the analysis [15-17].
Prediction of physico–chemical properties of materials based on their molecular structure has been one of the wishes of scientists and engineers for a long time. One of the best methods which have been applied for this purpose is quantitative structure–property relationships (QSRR). QSRR analysis is now a well established and highly respected technique to correlate chromategraphic retention time of a compound with its molecular structure, through a variety of descriptors. The basic strategy of QSRR analysis is to find optimum quantitative relationships, which can then be used for the prediction of the retention from molecular structures [18, 19]. Once a reliable relation has been obtained, it is possible to use it to predict that retention for other structures not yet measured or even not yet prepared. QSRR on the retention time have been reported for different types of organic compounds [20-22].
The application of this technique usually requires variable selection for building well-fitted models. Nowadays, the genetic algorithm method (GA) is well known as an interesting and more widely used variable selection method. GA is a stochastic method that solves the optimization problems defined by fitness criteria, applying the evolution hypothesis of Darwin and different genetic functions, i.e. crossover and mutation [23, 24].
In this work, we aim to construct a QSRR model of the retention time of organic contaminants in natural water and wastewater and their theoretically derived descriptors. After the variables were selected, the linear multivariate regressions (e.g. the partial least squares (PLS)) as well as the non-linear regressions (e.g. the kernel PLS (KPLS), Levenberg- Marquardt artificial neural network (L-M ANN)) were utilized to construct the linear and non-linear QSRR models. The sets of variables, which provide the best-fitted models for PLS and KPLS methods, were selected with the help of the genetic algorithm.
Materials and methods
Equipment
A Pentium IV personal computer (CPU at 3.06 GHz) with the Windows XP operating system was used. The geometry optimization was performed with HyperChem (Version 7.0 Hypercube, Inc). For the calculation of the molecular descriptors, the Dragon 2.1 software was used. The GA-PLS, GA-KPLS, L-M ANN, cross validation and the other calculations were performed in the MATLAB (Version 7.0, Math works, Inc).
Data set and descriptor generation
The data set used in this study, is the retention time (RT) of organic contaminants in natural water and wastewater (a total number of 150 molecules), which obtained by gas chromatography time-of-flight mass spectrometry (GC-TOF) were taken from the literature [25] is shown in Table 1 and Table 2. The constituents of organic pollutants in natural water and wastewater includes: PAHs, octyl/nonyl phenols, PCBs, PBDEs and a notable number of pesticides, such as insecticides (organochlorines, organophosphorus, carbamates and pyrethroids), herbicides (triazines and chloroacetanilides), fungicides and several relevant metabolites. Water samples of different types and origin were collected from different sites of the Castellَn province (Spain). Concretely, two surface water (SW) (Villarreal and Burriana), two ground water (GW) (Almassora and Castellَn), and two effluent water samples (EWW) from a wastewater treatment plant (WWTP) of Castellَn were collected. The chemical structure of the 150 studied molecules were drawn with the Hyperchem software and saved with the HIN extension. To optimize the geometry of the studied molecules, the AM1 geometrical optimization was applied. The DRAGON software was used to calculate the descriptors in this research and a total of 1497 molecular descriptors, belonging to 18 different types of the theoretical descriptors, were calculated for each molecule.
Instrumentation
GC instrumentation consisted of an Agilent 6890N GC system (Paloalto, CA, USA), equipped with an Agilent 7683 autosampler, coupled to a time-of-flight mass spectrometer, GCT (Waters Corporation, Manchester, UK), operating in electron ionization (EI) mode. The GC separation was performed using a fused silica HP-5MS capillary column of 30m×0.25mm i.d. and a film thickness of 0.25m (J&W Scientific, Folson, CA, USA). The oven temperature was programmed as follows: 90 ◦C (1min); 5 ◦C/min to 300 ◦C (2min). Splitless injections of 1L sample were carried out. Helium was used as carrier gas at 1mL/min. The interface and source temperatures were both set to 250 ◦C and a solvent delay of 3min was selected. TOF MS was operated at 1 spectrum/s acquiring the mass range m/z 50–650 and using a multi-channel plate voltage of 2800V. TOF-MS resolution was about 8500 (FWHM) atm/z 614. Heptacosa, used for the daily mass calibration as well as lockmass, was injected via syringe in the reference reservoir at 30 ◦C. The m/z ion monitored was 218.9856. The application manager TargetLynx, a module of MassLynx software, was used to process data obtained from standards and samples for target compounds. The application manager ChromaLynx, also a module of MassLynx software, was used to investigate the presence of non-target compounds in samples. Library searching was performed using the commercial NIST library.
Data pretreatment
To decrease the redundancy existing in the descriptor data matrix, those descriptors which contribute either no information or whose information content is redundant with other descriptors present in the pool. Then, the remaining descriptors were collected in an n m data matrix (D), where n = 150 and m=1019 are the number of the compounds and the descriptors, respectively. These descriptors were employed to generate the models with the GA-PLS and GA-KPLS program.
Genetic algorithm
Genetic algorithm is a problem-solving method that uses generic rules such as reproduction, crossover and mutation to build pseudo organisms that are then selected based on a fitness criterion to survive and pass information on to the next generation [26]. GA uses a binary bit string representation as the coding technique for a given problem; the presence or absence of a descriptor in a chromosome is coded by 1 or 0. A string is composed of several genes that represent a specific characteristic to be studied. In the present case, a string is composed of 561 genes representing the presence or absence of a descriptor. By encoding various descriptors with bit strings, called chromosomes, the initial population was created randomly. The population size was varied between 50 and 300 for different GA runs. For a typical run, the evolution of the generation was stopped when 90% of the generations had taken the same fitness [27, 28]. In this paper, size of the population is 30 chromosomes, the probability of initial variable selection is 5:V (V is the number of independent variables), crossover is multi Point, the probability of crossover is 0.5, mutation is multi Point, the probability of mutation is 0.01 and the number of evolution generations is 1000. For each set of data, 5000 runs were performed.
Nonlinear model
Artificial neural network
A three-layer back propagation artificial neural network ANN with a sigmoid transfer function was used in the investigation of feature sets. The descriptors from the training set were used for the model generation whereas the descriptors from the validation set were used to stop the overtraining of network. And the descriptors from the validation set were used to verify the predictivity of the model. Before training the networks, the input and output values were normalized with auto-scaling of all data [29, 30]. To compare the results, the same number of hidden layer nodes was used for the ANN models from all other feature sets of each database. The goal of training the network is to minimize the output errors by changing the weights between the layers.
(1)
In this, is the change in the weight factor for each network node, α is the momentum factor, and F is a weight update function, which indicates how weights are changed during the learning process. The weights of hidden layer were optimized using the Levenberg-Marquardt algorithm, a second derivative optimization method [31].
Levenberg-Marquardt Algorithm
In Levenberg-Marquardt algorithm, the update function, Fn, was calculated using equations (2-4).
(2)
(3)
(4)
Where g is gradient, and J is the Jacobian matrix that contains first derivatives of the network errors with respect to the weights, and e is a vector of network errors. The parameter µ is multiplied by some factor (λ) whenever a step would result in an increased e and when a step reduces e, µ is divided by λ [32, 33].
Results and discussion
Linear model
Results of the GA-PLS model
The best model is selected based on the highest square correlation coefficient leave-group-out cross validation (R2), the least root mean squares error (RMSE) and relative error (RE) of prediction. These parameters are probably the most popular measure of how well a model fits the data. The best GA-PLS model contains sixteen selected descriptors in seven latent variables space. These descriptors were obtained constitutional descriptors (mean electrotopological state (Ms)), 2D autocorrelations (Broto-Moreau autocorrelation of a topological structure - lag 5/weighted by atomic masses (ATS5m), Broto-Moreau autocorrelation of a topological structure - lag 5 / weighted by atomic van der Waals volumes (ATS5v), Broto-Moreau autocorrelation of a topological structure - lag 5 / weighted by atomic Sanderson electronegativities (ATS5e), Moran autocorrelation - lag 3/weighted by atomic polarizabilities (MATS3p), Geary autocorrelation - lag 5/weighted by atomic Sanderson electronegativities (GATS5e) and Geary autocorrelation - lag 5 / weighted by atomic polarizabilities (GATS5p)), geometrical descriptors (spherosity (SPH)), absolute eigenvalue sum on geometry matrix (SEig)), 3D-MoRSE descriptors (3D-MoRSE - signal 11/weighted by atomic masses (Mor11m), 3D-MoRSE - signal 29 / weighted by atomic masses (Mor29m) and 3D-MoRSE - signal 12/weighted by atomic van der Waals volumes (Mor12v)), GETAWAY descriptors (leverage-weighted autocorrelation of lag 5 / unweighted (HATS5u)), atom-centred fragments (CR3X (C-011) and R--CX..X (C-035)) and charge descriptors (total negative charge (Qneg)). The R2 and RMSE for training and validation sets were (0.809, 0.740) and (0.599, 1.055), respectively. The predicted values of RT are plotted against the experimental values for training and test sets in Figure 1. For this in general, the number of components (latent variables) is less than the number of independent variables in PLS analysis. The PLS model uses higher number of descriptors that allow the model to extract better structural information from descriptors to result in a lower prediction error.
Figure 1. Plots of predicted retention time against the experimental values by GA-PLS model
Nonlinear model
Results of the GA-KPLS model
PLS is useful in situations where the number of explanatory variables exceeds the number of observations and/or a high level of multicollinearity among those variables is assumed. Motivated by this fact we will provide a kernel PLS algorithm for construction of nonlinear regression models in possibly high-dimensional feature spaces. PLS has proven to be useful in situations when the number of observed variables (N) is significantly greater than the number of observations (n) and high multicollinearity among the variables exists. This situation when N ≥ n is common in chemometrics and gave rise to the modification of classical principal component analysis (PCA) and linear PLS methods to their kernel variants. However, rather than assuming a nonlinear transformation into a feature space of arbitrary dimensionality the authors attempted to reduce computational complexity in the input space. Motivated by these works we propose a more general nonlinear kernel PLS algorithm.
In this paper a radial basis kernel function, k(x,y)= exp(||x-y||2/c), was selected as the kernel function with () where r is a constant that can be determined by considering the process to be predicted (here r was set to be 1), m is the dimension of the input space and is the variance of the data [34]. It means that the value of c depends on the system under the study. The 13 descriptors in 5 latent variables space chosen by GA-KPLS feature selection methods were contained. These descriptors were obtained topological descriptors (Schultz Molecular Topological Index (MTI) (SMTI), Harary H index (Har), average eccentricity (AECC) and eccentric connectivity index (CSI)), 2D autocorrelations (Broto-Moreau autocorrelation of a topological structure - lag 5/weighted by atomic van der Waals volumes (ATS5v) and Moran autocorrelation - lag 3/weighted by atomic polarizabilities (MATS3p)), Burden eigenvalues (lowest eigenvalue n. 1 of Burden matrix/weighted by atomic Sanderson electronegativities (BELe1), geometrical descriptors (average span R (SPAM)), 3D-MoRSE descriptors (3D-MoRSE - signal 03 / weighted by atomic masses (Mor03m), 3D-MoRSE - signal 19 / weighted by atomic masses (Mor19m), 3D-MoRSE - signal 23 / weighted by atomic masses (Mor23m), 3D-MoRSE - signal 17 / weighted by atomic van der Waals volumes (Mor17v)), molecular properties (Squared Moriguchi octanol-water partition coeff. (logP^2) (MLOGP2)). The R2 and RMSE for training and test sets were (0.781, 0.716) and (0.649, 1.293), respectively. Figure 2 shows the plot of the GA-KPLS predicted versus experimental values for RT of all of the molecules in the data set. It can be seen from these results that statistical results for GA-PLS model are superior to GA-KPLS method.
Figure 2. Plot of predicted RT obtained by GA-KPLS against the experimental values
Results of the L-M ANN model
The networks were generated using descriptors appearing in the GA-PLS model as inputs. For ANN generation, dataset was separated into three groups: calibration, prediction and test sets. Before training, the input and output values were normalized between 0 and 1. Number of neurons in the hidden layer, learning rate and momentum were optimized. A feed-forward neural network with back-propagation algorithm was constructed to model the retention relationship [35]. This method is an iterative algorithm that allows training of multilayer networks. The algorithm looks for the minimum of the error function. In this way, the training process tries to diminish the difference between the outputs of the network and the expected values. Of course, there are some other approaches such as Levenberg Marquardt algorithm, gradient descent with variable learning rate back-propagation and resilient back-propagation. These networks are different in weight update functions and can converge faster than steepest decent method [36]. But this paper has not focused on investigating the role of weight update functions or calculation time in artificial neural networks. Our network has nine input layer, four hidden layer and one output layer. A bias unit with a constant activation of unity is connected to each unit in the hidden and output layers. Once the best topology of the network is obtained and the convergence criterion is reached, a leave-4- out cross-validation procedure is also employed to more validate the performances of the resulted networks. To evaluate the performance of the ANN, RMSE of the calibration was used. The number of neurons in the hidden layer with the minimum value of RMSE was selected as the optimum number. Learning rate and momentum were optimized in a similar way. It was realized that the RMSE for the training and test sets are minimum when four neurons were selected in the hidden layer. The R2 and RMSE for calibration, prediction and test sets were (0.945, 0.929, 0.861) and (0.165, 0.353, 0.522), respectively. Inspection of the results reveals a higher R2 and lowers other values parameter for the test set compared with their counterparts for other models. Plots of predicted RT versus experimental RT values by L-M ANN for calibration, prediction and test sets are shown in Figure 3a, 3b, respectively.
Figure 3. Plot of predicted RT obtained by L-M ANN against the experimental values (a) calibration and prediction sets of molecules and (b) for validation set
The residuals (predicted RT− experimental RT) obtained by the L-M ANN modeling versus the experimental RT values are shown in Figure 4a, 4b. As the calculated residuals are distributed on both sides of the zero line, one may conclude that there is no systematic error in the development of the neural network.
Figure 4. Plot of residuals obtained by L-M ANN against the experimental RT values (a) training set of molecules and (b) for test set
The values of experimental, calculated and RMSE are shown in Table 1 and Table 2 for training and test sets which obtained by L-M ANN model. The Q2 of training and test sets for the GA-PLS and GA-KPLS models are (0.802, 0.734) and (0.775, 0.712) respectively which would be compared with the values of (0.943, 0.924, 0.853), respectively, for L-M ANN model. Comparison between these values and other statistical parameters reveals the superiority of the L-M ANN model over other models. The key strength of neural networks, unlike regression analysis, is their ability to flexible mapping of the selected features by manipulating their functional dependence implicitly. The statistical parameters reveal the high predictive ability of L-M ANN model.
Table 1. The data set, structure, the corresponding observed, calculate and root mean square error values retention time of training set for L-M ANN
No |
Name |
Molecular formula |
RTExp |
RTCal |
RMSE |
|
Calibration Set |
|
|
|
|
1 |
Dichlorvos |
C4H7Cl2O4P |
7.85 |
7.41 |
0.047 |
2 |
Mevinphos |
C7H13O6P |
12.08 |
12.45 |
0.039 |
3 |
Acenaphthene |
C12H10 |
13.25 |
12.06 |
0.125 |
4 |
Methacrifos |
C7H13O5PS |
13.8 |
13.05 |
0.079 |
5 |
Heptenophos |
C9H12ClO4P |
15.45 |
15.97 |
0.055 |
6 |
Fluorene |
C13H10 |
15.47 |
14.89 |
0.062 |
7 |
Tecnazene |
C6HCl4NO2 |
15.95 |
15.33 |
0.065 |
8 |
Diphenylamine |
C12H11N |
16.33 |
17.05 |
0.076 |
9 |
Chlorpropham |
C10H12ClNO2 |
17.08 |
18.63 |
0.164 |
10 |
Terbumeton desethyl |
C8H15N5O |
17.18 |
15.91 |
0.134 |
11 |
Atrazine desethyl |
C6H10ClN5 |
17.28 |
15.79 |
0.157 |
12 |
Trifluraline |
C13H16F3N3O4 |
17.79 |
16.46 |
0.141 |
13 |
Hexachlorobenzene |
C6Cl6 |
18.3 |
17.08 |
0.129 |
14 |
Dimethoate |
C5H12NO3PS2 |
18.68 |
20.22 |
0.162 |
15 |
Atrazine |
C8H14ClN5 |
19.2 |
17.80 |
0.147 |
16 |
Lindane |
C6H6Cl6 |
19.39 |
19.84 |
0.048 |
17 |
Terbumeton |
C10H19N5O |
19.47 |
18.57 |
0.095 |
18 |
Phenanthrene |
C14H10 |
19.72 |
18.62 |
0.116 |
19 |
Fonofos |
C10H15OPS2 |
19.8 |
20.64 |
0.089 |
20 |
Propyzamide |
C12H11Cl2NO |
19.92 |
20.12 |
0.021 |
21 |
Diazinon |
C12H21N2O3PS |
20.37 |
22.17 |
0.190 |
22 |
Terbacil |
C9H13ClN2O2 |
20.54 |
19.84 |
0.074 |
23 |
Endosulfan ether |
C9H6Cl6O |
21.04 |
21.90 |
0.091 |
24 |
Pirimicarb |
C11H18N4O2 |
21.35 |
23.24 |
0.200 |
25 |
PCB 28 |
C12H7Cl3 |
21.69 |
22.04 |
0.037 |
26 |
Chlorpyrifos methyl |
C7H7Cl3NO3PS |
21.95 |
23.62 |
0.176 |
27 |
Parathion methyl |
C8H10NO5PS |
22.05 |
23.15 |
0.116 |
28 |
Chlozolinate |
C13H11Cl2NO5 |
22.08 |
21.55 |
0.056 |
29 |
Alachlor |
C14H20ClNO2 |
22.39 |
23.40 |
0.107 |
30 |
Fenchlorphos |
C8H8Cl3O3PS |
22.62 |
21.09 |
0.161 |
31 |
Metalaxyl |
C15H21NO4 |
22.63 |
24.69 |
0.218 |
32 |
Methiocarb sulfone |
C11H15NO4S |
22.92 |
21.53 |
0.147 |
33 |
Methiocarb |
C11H15NO2S |
23.14 |
24.42 |
0.135 |
34 |
Fenitrothion |
C9H12NO5PS |
23.17 |
23.70 |
0.055 |
35 |
Pirimiphos methyl |
C11H20N3O3PS |
23.32 |
21.74 |
0.166 |
36 |
Dichlofluanide |
C9H11Cl2FN2O2S2 |
23.42 |
21.22 |
0.232 |
37 |
Metolachlor |
C15H22ClNO2 |
23.79 |
22.78 |
0.106 |
38 |
Fenthion |
C10H15O3PS2 |
23.92 |
26.17 |
0.238 |
39 |
Chlorpyrifos |
C9H11Cl3NO3PS |
24 |
26.16 |
0.228 |
40 |
Isodrin |
C12H8Cl6 |
24.62 |
23.28 |
0.141 |
41 |
Cyprodinil |
C14H15N3 |
24.95 |
27.24 |
0.241 |
42 |
Heptachlor epoxide B |
C10H5Cl7O |
25.09 |
25.99 |
0.095 |
43 |
Fluoranthene |
C16H10 |
25.2 |
22.96 |
0.236 |
44 |
Heptachlor epoxide A |
C10H5Cl7O |
25.25 |
25.75 |
0.053 |
45 |
Chlorfenvinphos |
C12H14Cl3O4P |
25.57 |
25.02 |
0.058 |
46 |
Isofenphos |
C15H24NO4PS |
25.6 |
24.14 |
0.154 |
47 |
Procymidone |
C13H11Cl2NO2 |
25.85 |
24.21 |
0.173 |
48 |
Methidathion |
C6H11N2O4PS3 |
26.12 |
23.91 |
0.233 |
49 |
Fenoxycarb |
C17H19NO4 |
26.37 |
23.77 |
0.274 |
50 |
-Endosulfan |
C9H6Cl6O3S |
26.42 |
27.95 |
0.161 |
51 |
PCB 77 |
C12H6Cl4 |
27.32 |
24.69 |
0.278 |
52 |
Dieldrin |
C12H8Cl6O |
27.39 |
29.46 |
0.218 |
53 |
PCB 81 |
C12H6Cl4 |
27.69 |
26.87 |
0.086 |
54 |
Buprofezin |
C16H23N3OS |
27.87 |
29.11 |
0.131 |
55 |
Bupimirate |
C13H24N4O3S |
28.07 |
30.70 |
0.277 |
56 |
-Endosulfan |
C9H6Cl6O3S |
28.52 |
26.56 |
0.207 |
57 |
BDE 28 |
C12H7OBr3 |
28.68 |
26.75 |
0.204 |
58 |
p,-DDD |
C14H10Cl4 |
28.97 |
27.89 |
0.114 |
59 |
Oxadixyl |
C14H18N2O4 |
29.15 |
26.41 |
0.289 |
60 |
PCB 153 |
C12H4Cl6 |
29.47 |
28.63 |
0.089 |
61 |
PCB 123 |
C12H5Cl5 |
29.59 |
26.79 |
0.295 |
62 |
p, -DDT |
C14H9Cl5 |
30.3 |
32.08 |
0.188 |
63 |
PCB 126 |
C12H5Cl5 |
30.75 |
33.82 |
0.324 |
64 |
Tebuconazole |
C16H22ClN3O |
30.8 |
28.40 |
0.253 |
65 |
PCB 156 |
C12H4Cl6 |
31.45 |
32.22 |
0.081 |
66 |
Benzo(a)anthracene |
C18H12 |
31.84 |
29.50 |
0.247 |
67 |
Phosmet |
C11H12NO4PS2 |
32.08 |
30.86 |
0.129 |
68 |
PCB 157 |
C12H4Cl6 |
32.24 |
29.89 |
0.248 |
69 |
Bifenthrin |
C23H22ClF3O2 |
32.39 |
30.60 |
0.189 |
70 |
PCB 167 |
C12H4Cl6 |
32.44 |
31.77 |
0.071 |
71 |
PCB 180 |
C12H3Cl7 |
32.84 |
30.44 |
0.253 |
72 |
BDE 47 |
C12H6OBr4 |
32.92 |
31.22 |
0.180 |
73 |
Tetradifon |
C12H6Cl4O2S |
33.07 |
35.90 |
0.298 |
74 |
PCB 169 |
C12H4Cl6 |
33.55 |
32.82 |
0.077 |
75 |
Mirex |
C10Cl12 |
33.62 |
33.70 |
0.008 |
76 |
Fenarimol |
C17H12Cl2N2O |
34.39 |
31.59 |
0.295 |
77 |
PCB 189 |
C12H3Cl7 |
34.82 |
31.49 |
0.351 |
78 |
Permethrin II |
C21H20Cl2O3 |
35.9 |
33.46 |
0.258 |
79 |
Coumaphos |
C14H16ClO5PS |
36.02 |
38.50 |
0.262 |
80 |
Benzo(b)fluoranthene |
C20H12 |
36.55 |
39.32 |
0.292 |
81 |
Cypermethrin I |
C22H19Cl2NO3 |
37.42 |
35.08 |
0.247 |
82 |
Cypermethrin II |
C22H19Cl2NO3 |
37.62 |
41.32 |
0.390 |
83 |
Cypermethrin IV |
C22H19Cl2NO3 |
37.79 |
39.39 |
0.169 |
84 |
Benzo(a)pyrene |
C20H12 |
37.81 |
40.62 |
0.296 |
85 |
Fenvalerate I |
C25H22ClNO3 |
39.15 |
38.99 |
0.017 |
86 |
BDE 154 |
C12H4OBr6 |
39.17 |
39.14 |
0.003 |
87 |
-Fluvalinate II |
C26H22ClF3N2O3 |
39.7 |
38.01 |
0.178 |
88 |
BDE 153 |
C12H4OBr6 |
40.3 |
36.76 |
0.373 |
89 |
Indeno(1,2,3,cd)pyrene |
C22H12 |
41.89 |
39.90 |
0.210 |
90 |
Dibenzo(a,h)anthracene |
C22H14 |
42.07 |
39.43 |
0.278 |
|
Prediction Set |
|
|
|
|
91 |
Methamidophos |
C2H8NO2PS |
7.35 |
8.40 |
0.191 |
92 |
Pentachlorobenzene |
C6HCl5 |
14.09 |
14.40 |
0.056 |
93 |
Omethoate |
C5H12NO4PS |
15.72 |
17.36 |
0.300 |
94 |
Atrazine desisopropyl |
C5H8ClN5 |
16.98 |
18.01 |
0.187 |
95 |
Phorate |
C7H17O2PS3 |
17.97 |
17.55 |
0.077 |
96 |
4-n-Octylphenol |
C14H22O |
19.44 |
20.71 |
0.232 |
97 |
Anthracene |
C14H10 |
19.92 |
19.71 |
0.038 |
98 |
4-n-Nonylphenol |
C15H24O |
21.57 |
24.47 |
0.530 |
99 |
Carbaryl |
C12H11NO2 |
22.22 |
22.71 |
0.089 |
100 |
Terbutryn |
C10H19N5S |
23.09 |
21.31 |
0.326 |
101 |
Malathion |
C10H19O6PS2 |
23.67 |
25.29 |
0.296 |
102 |
Pirimiphos ethyl |
C13H24N3O3PS |
24.9 |
25.62 |
0.131 |
103 |
Thiabendazole |
C10H7N3S |
25.3 |
24.17 |
0.206 |
104 |
Quinalphos |
C12H15N2O3PS |
25.65 |
27.73 |
0.381 |
105 |
Pyrene |
C16H10 |
26.15 |
23.98 |
0.396 |
106 |
Imazalil |
C14H14Cl2N2O |
27.2 |
31.04 |
0.700 |
107 |
p, -DDE |
C14H8Cl4 |
27.45 |
26.13 |
0.241 |
108 |
PCB 118 |
C12H5Cl5 |
28.64 |
26.93 |
0.312 |
109 |
Ethion |
C9H22O4P2S4 |
29.24 |
30.98 |
0.318 |
110 |
PCB 138 |
C12H4Cl6 |
30.45 |
27.41 |
0.556 |
111 |
Iprodione |
C13H13Cl2N3O3 |
31.89 |
31.38 |
0.094 |
112 |
BDE 71 |
C12H6OBr4 |
32.4 |
31.39 |
0.185 |
113 |
BDE 66 |
C12H6OBr4 |
33.47 |
30.50 |
0.543 |
114 |
Pyrazophos |
C14H20N3O5PS |
34.74 |
32.08 |
0.485 |
115 |
BDE 100 |
C12H5OBr5 |
35.95 |
36.04 |
0.016 |
116 |
BDE 99 |
C12H5OBr5 |
36.8 |
40.64 |
0.700 |
117 |
BDE 85 |
C12H5OBr5 |
38.35 |
42.86 |
0.824 |
118 |
-Fluvalinate I |
C26H22ClF3N2O3 |
39.57 |
35.84 |
0.682 |
119 |
BDE 138 |
C12H4OBr6 |
41.85 |
45.27 |
0.625 |
120 |
Benzo(g,h,l)perylene |
C22H12 |
42.69 |
47.47 |
0.873 |
Table 2 . The data set, structure, observed, calculate and RMSE values RT for test set by L-M ANN
No |
Name |
Molecular formula |
RTExp |
RTCal |
RMSE |
1 |
Naphthalene |
C10H8 |
6.5 |
7.66 |
0.212 |
2 |
Acenaphthylene |
C12H8 |
12.43 |
12.52 |
0.016 |
3 |
Molinate |
C9H17NOS |
15.38 |
14.83 |
0.101 |
4 |
4-t-Octylphenol |
C14H22O |
15.99 |
18.37 |
0.434 |
5 |
Terbuthylazine desethyl |
C7H12ClN5 |
17.68 |
17.59 |
0.016 |
6 |
Simazine |
C7H12ClN5 |
18.95 |
15.64 |
0.604 |
7 |
Terbuthylazine |
C9H16ClN5 |
19.77 |
23.80 |
0.735 |
8 |
Etrimfos |
C10H17N2O4PS |
20.92 |
16.50 |
0.807 |
9 |
Fosfamidon |
C10H19ClNO5P |
21.78 |
25.09 |
0.604 |
10 |
Heptachlor |
C10H5Cl7 |
22.2 |
26.02 |
0.697 |
11 |
PCB 52 |
C12H6Cl4 |
23.05 |
28.83 |
1.056 |
12 |
Aldrin |
C12H8Cl6 |
23.52 |
21.05 |
0.451 |
13 |
Parathion ethyl |
C10H14NO5PS |
24.02 |
22.75 |
0.231 |
14 |
Penconazole |
C13H15Cl2N3 |
25.25 |
29.31 |
0.742 |
15 |
Hexythiazox |
C17H21ClN2O2S |
26.04 |
31.87 |
1.064 |
16 |
Profenofos |
C11H15BrClO3PS |
27.35 |
26.30 |
0.193 |
17 |
PCB 105 |
C12H5Cl5 |
28.55 |
26.14 |
0.441 |
18 |
PCB 114 |
C12H5Cl5 |
29.04 |
31.15 |
0.384 |
19 |
Endosulfan sulfate |
C9H6Cl6O4S |
30.09 |
28.10 |
0.364 |
20 |
Diflufenican |
C19H11F5N2O2 |
31.14 |
35.31 |
0.762 |
21 |
Chrysene |
C18H12 |
32.02 |
39.80 |
1.420 |
22 |
Metoxychlor |
C16H15Cl3O2 |
32.42 |
34.52 |
0.383 |
23 |
Phosalone |
C12H15ClNO4PS2 |
33.44 |
28.64 |
0.876 |
24 |
-Cyhalothrin |
C23H19ClF3NO3 |
34.34 |
31.73 |
0.477 |
25 |
Permethrin I |
C21H20Cl2O3 |
35.65 |
32.60 |
0.557 |
26 |
Benzo(k)fluoranthene |
C20H12 |
36.65 |
38.49 |
0.337 |
27 |
Cypermethrin III |
C22H19Cl2NO3 |
37.79 |
37.99 |
0.036 |
28 |
Fenvalerate II |
C25H22ClNO3 |
39.55 |
39.72 |
0.031 |
29 |
Deltamethrin |
C22H19Br2NO3 |
40.55 |
34.50 |
1.105 |
30 |
BDE 183 |
C12H3OBr7 |
43.65 |
40.81 |
0.519 |
The whole of these data clearly displays a significant improvement of the QSRR model consequent to nonlinear statistical treatment. Obviously, there is a close agreement between the experimental and predicted RT and the data represent a very low scattering around a straight line with respective slope and intercept close to one and zero. As can be seen in this section, the L-M ANN is more reproducible than GA-PLS and GA-KPLS for modeling the retention time of organic contaminants in natural water and wastewater.
Model validation and statistical parameters
Model validation
Validation is a crucial aspect of any QSPR/QSRR modeling. The accuracy of proposed models was illustrated using the evaluation techniques such as leave-group-out cross validation (LGO-CV) procedure and validation through an external test set.
3.3.2 Cross validation technique Cross validation is a popular technique used to explore the reliability of statistical models. Based on this technique, many modified data sets are created by deleting in each case one or a small group (leave-some-out) of objects. For each data set, an input–output model is developed, based on the utilized modeling technique. Each model is evaluated, by measuring its accuracy in predicting the responses of the remaining data (the ones or group data that have not been utilized in the development of the model) [37]. The LGO procedure was utilized in this study. A QSRR model was then constructedbased on this reduced data set and subsequently used to predict the removed data. This procedure was repeated until a complete set of predicted was obtained. The statistical significance of the screened model was judged by the correlation coefficient (Q2).
The accuracy of cross validation results is extensively accepted in the literature considering the Q2 value. In this sense, a high value of the statistical characteristic (Q2 > 0.5) is considered as proof of the high predictive ability of the model. However, this assumption is in many cases incorrect and can be that exist the lack of the correlation between the high LGO Q2 and the high predictive ability of QSRR models has been established and corroborated recently [38]. Thus, the high value of LGO-CV Q2 appears to be necessary but not sufficient condition for the models to have a high predictive power. These authors stated that an external set is necessary. As a next step, further analysis was also followed for chemical property of the new set of compounds using the developed QSRR model.
Validation through the external validation set
Validating QSRR with external data (i.e. data not used in the model development) is the best method of validation. However, the availability of an independent external validation set of several compounds is rare in QSRR. Thus, the predictive ability of a QSRR model with the selected descriptors was further explored by dividing the full data set. The predictive power of the models developed on the selected training set is estimated on the predicted values of test set chemicals. The data set was randomly divided into training (calibration and prediction sets) and test sets after sorting based on the RT values. The data set was randomly divided into three groups including calibration and prediction sets (training set) and test set. The calibration set was used for model generation. The prediction set was applied deal with overfitting of the network, whereas test set which its molecules have no role in model building was used for the evaluation of the predictive ability of the models for external set. The calibration set consisted of 90 molecules; prediction set consisted of 30 molecules and the test set, consisted of 30 molecules. The whole of these data clearly displays a significant improvement of the QSRR model consequent to non-linear statistical treatment and a substantial independence of model prediction from the structure of the test molecule. In the above analysis, the descriptive power of a given model has been measured by its ability to predict retention of unknown molecules. For instance, as to prediction ability, it can be observed in Figure 3 that scattering of data points from the ideal trend in test set is poor.
Statistical parameters
For the constructed models, some general statistical parameters were selected to evaluate the predictive ability of the models for RT values. In this case, the predicted RT of each sample in prediction step was compared with the experimental acidity constant.
Root mean square error (RMSE) is a measurement of the average difference between predicted and experimental values, at the prediction step. RMSE can be interpreted as the average prediction error, expressed in the same units as the original response values. Its small value indicates that the model predicts better than chance and can be considered statistically significant. The RMSE was obtained by the following formula:
(5)
The other statistical parameter was relative error (RE) that shows the predictive ability of each component, and is calculated as:
(6)
The predictive ability was evaluated by the cross validation coefficient (Q2 or R2cv) which is based on the prediction error sum of squares (PRESS) and was calculated by following equation:
(7)
Where yi is the experimental RT in the sample i, represented the predicted RT in the sample i, is the mean of experimental RT in the prediction set and n is the total number of samples used in the test set [39, 40].
Conclusion
Organic pollutants in water can harm the environment and pose health risks for humans. Organic pollutants pose special risks because they are often not naturally broken down and can remain in water sources for decades or longer. The analysis of organic pollutants in water allows managers to assess the quality and safety of water sources. The GA-PLS, GA-KPLS and L-M ANN modeling was applied for the prediction of the retention time of 150 organic contaminants in natural water and wastewater. High correlation coefficients and low prediction errors confirmed the good predictability of models. Application of the developed model to a validation set of 30 compounds demonstrates that the new model is reliable with good predictive accuracy and simple formulation. Three methods seemed to be useful, although a comparison between these methods revealed the slight superiority of the L-M ANN over the other models.
Talanta. 2011, 83:1014
[31] Jalali-Heravi M., Asadollahi-Baboli M., Shahbazikhah P., Eur. J. Med. Chem., 2008, 43:548